# loading data
kang_seed <- read.table(here("data/sev208_kratseedbank_20120213.txt"), sep=',', header = TRUE) |>
mutate(mnd = as.factor(mnd)) |>
filter(!(species %in% c("soil", "plant", "dist", "litter", "gravel")))
gg_miss_var(kang_seed)
skim(kang_seed)
| Name | kang_seed |
| Number of rows | 960 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| dir | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
| loc | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
| species | 0 | 1 | 4 | 6 | 0 | 8 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| mnd | 0 | 1 | FALSE | 10 | 18: 128, 23: 128, 25: 128, 6: 96 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| seeds | 0 | 1 | 11.1 | 48.21 | 0 | 0 | 0 | 3 | 656 | ▇▁▁▁▁ |
# visualize counts
ggplot(kang_seed, aes(x = seeds)) +
geom_histogram(bins = 17)
#visualize using box plot
ggplot(data = kang_seed, aes(x = mnd, y = seeds)) +
geom_boxplot() +
labs(x = "Mound Number", y = "NUmber of Seeds", title = "Boxplot of Seeds Found at Different Mounds",
caption = "Figure 1: ") +
theme_bw()
# set up data for bar graph
kang_seed_bar <- kang_seed |>
group_by(mnd) |>
summarise(across(seeds, sum)) |>
ungroup()
# bar graph
ggplot() +
geom_bar(data = kang_seed_bar, aes(x = mnd, y = seeds, fill = mnd),
stat = "identity", show.legend = F) +
theme_bw() +
labs(x = "Mound Number", y = "Number of Seeds",
title = "Total Seed Count per Mound", caption = )
Question: How does total seed number differ between kangaroo rat mound locations?
We will start by attempting to run an ANOVA test to determine if there is a difference in the total seed count between kangaroo rat mound locations.
\(H_0:\) There is no significant
difference in seed counts between the kangaroo rat mound
locations.
\(H_a:\) There is a significant
difference in seed counts between at least two of the kangaroo mound
locations.
# make aov object
kang_aov <- aov(seeds ~ mnd, data = kang_seed)
# get resiudals
kang_res <- kang_aov$residuals
# check normality
qqPlot(kang_res)
## [1] 899 195
The data does not appear to be normal, which we expected because in the context reading they discussed how in their analysis the data was non-normal even after trying transformations. We decided to try a log transformation regardless because this would be the logical next step in this analysis.
# transform data
kang_seed_log <- kang_seed |>
mutate(seeds = log(seeds + 1))
# make new aov object
kang_log_aov <- aov(seeds ~ mnd, data = kang_seed_log)
# residuals
kang_log_res <- kang_log_aov$residuals
#test normality
qqPlot(kang_log_res)
## [1] 899 195
The data is still non-normal as expected (we stated above that the context paper mentioned how the data is not normal after all transformations).
Due to this we are going to run a Kruskal-Wallis test, for which we meet all assumptions (there are categorical predictor variables, there are independent samples, and each group has at least five observations) and works with count data. This will change our null and alternative hypothesis to be:
\(H_0:\) There is no difference in
the median total seed number between the kangaroo rat mound
locations.
\(H_a:\) There is a difference in at
least two kangaroo rat mound locations median total seed number.
# show aov summary
kruskal.test(seeds ~ mnd, data = kang_seed)
##
## Kruskal-Wallis rank sum test
##
## data: seeds by mnd
## Kruskal-Wallis chi-squared = 16.997, df = 9, p-value = 0.04876
After running the Kruskal-Wallis test, we reject the null hypothesis. (See results).
# loading data
shrub_raw <- read.csv(here("data/shrubstudy_seed.csv"))
shrub_seed <- shrub_raw |>
dplyr::select(treatment, species, total_inf = total_nr_infl,
num_seeds = nr_seeds, shrub_num, plant_num = plant_nr, tag_num) |>
mutate(treatment = as.factor(treatment))
# visualize missing data
gg_miss_var(shrub_seed)
skim(shrub_seed)
| Name | shrub_seed |
| Number of rows | 287 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| species | 0 | 1 | 6 | 6 | 0 | 19 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| treatment | 0 | 1 | FALSE | 2 | con: 174, shr: 113 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| total_inf | 0 | 1.00 | 5.15 | 10.45 | 1 | 1.00 | 1 | 5.00 | 117 | ▇▁▁▁▁ |
| num_seeds | 105 | 0.63 | 14.55 | 28.62 | 0 | 1.25 | 5 | 13.75 | 285 | ▇▁▁▁▁ |
| shrub_num | 0 | 1.00 | 29.53 | 16.04 | 5 | 13.00 | 30 | 44.00 | 54 | ▆▁▇▂▅ |
| plant_num | 0 | 1.00 | 2.43 | 1.34 | 1 | 1.00 | 2 | 3.00 | 5 | ▇▆▅▃▂ |
| tag_num | 0 | 1.00 | 143.48 | 63.71 | 22 | 104.00 | 159 | 181.00 | 300 | ▃▂▇▂▁ |
# filter out missing data
shrub_seed_sub <- shrub_seed |>
drop_na(num_seeds)
ggplot(shrub_seed_sub, aes(x = num_seeds)) +
geom_histogram(bins = 17)
ggplot(shrub_seed_sub, aes(x = species)) +
geom_bar(stat = 'count')
One thing to note from this plot is that TRIDAS appears to only have one sample. Because of this, in the future, the predictions are likely to be all over the place and have a very large confidence interval. In order to increase the interpretability of the following plots, we will remove TRIDAS.
# filter out missing data
shrub_seed_sub <- shrub_seed_sub |>
subset(species != "TRIDAS")
shrub_seed_sub |>
dplyr::select(!num_seeds) |>
ggpairs()
Question: How does seed count vary with plot type (shrub or open), plant species, and total number of inflorescences? Is there a simpler model that explains seed count, and if so, what is it?
# linear model, we know this is wrong
shrub_model1 <- lm(num_seeds ~ treatment + species + total_inf, data = shrub_seed_sub)
# generalized linear model with Poisson distribution
shrub_model2 <- glm(num_seeds ~ treatment + species + total_inf, data = shrub_seed_sub, family = "poisson")
# generalized linear model with negative binomial distribution
shrub_model3 <- glmmTMB(num_seeds ~ treatment + species + total_inf, data = shrub_seed_sub, family = "nbinom2")
# generalized linear model with Poisson distribution and random effect of treatment
shrub_model4 <- glmmTMB(num_seeds ~ treatment + species + total_inf + (1|shrub_num),
data = shrub_seed_sub, family = "poisson")
# generalized linear model with negative binomial distribution and random effect of treatment
shrub_model5 <- glmmTMB(num_seeds ~ treatment + species + total_inf + (1|shrub_num),
data = shrub_seed_sub, family = "nbinom2")
Because we are looking at count data, we know that the data is discrete and only has a lower bound. Knowing this, we built a couple different models using the Poisson and Negative Binomial distribution.
# check diagnostics
simulationOutput_m1 <- simulateResiduals(shrub_model1)
plot(simulationOutput_m1)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m1)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.96669, p-value = 0.744
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m1)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = Inf, p-value < 2.2e-16
## alternative hypothesis: two.sided
# check diagnostics
simulationOutput_m2 <- simulateResiduals(shrub_model2)
plot(simulationOutput_m2)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m2)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 21.719, p-value < 2.2e-16
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m2)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 3.992, p-value < 2.2e-16
## alternative hypothesis: two.sided
simulationOutput_m3 <- simulateResiduals(shrub_model3)
plot(simulationOutput_m3)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m3)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.4263, p-value = 0.24
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m3)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 1.4202, p-value = 0.024
## alternative hypothesis: two.sided
# check diagnostics
simulationOutput_m4 <- simulateResiduals(shrub_model4)
plot(simulationOutput_m4)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m4)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.83908, p-value = 0.464
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m4)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 3.1189, p-value < 2.2e-16
## alternative hypothesis: two.sided
# check diagnostics
simulationOutput_m5 <- simulateResiduals(shrub_model5)
plot(simulationOutput_m5)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m5)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.5972, p-value = 0.24
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m5)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 1.4281, p-value = 0.048
## alternative hypothesis: two.sided
| Model | Formula | Distribution | QQ Plot Residuals | Residual vs. Predicted | Over/Under Dispersion | Zeros |
|---|---|---|---|---|---|---|
| shrub_model1 | num_seeds ~ treatment + species + total_inf | Normal | F | F | Under | Too many |
| shrub_model2 | num_seeds ~ treatment + species + total_inf | Poisson | F | F | Over | Too many |
| shrub_model3 | num_seeds ~ treatment + species + total_inf | Negative Binomial | P | F | None | Too many |
| shrub_model4 | num_seeds ~ treatment + species + total_inf + (1|shrub_num) | Poisson | F | F | Over | Too many |
| shrub_model5 | num_seeds ~ treatment + species + total_inf + (1|shrub_num) | Negative Binomial | P | F | None | Too many |
Based on the tests above, it appears that either shrub_model3 or shrub_model5 will be the best models because neither are over or under-dispered. However, they do still have some patterns in the residuals, and they appear to be zero-inflated. To further our model, we will create a new model with a zero-inflated term.
# generalized linear model with negative binomial distribution
shrub_model6 <- glmmTMB(num_seeds ~ treatment + species + total_inf, ziformula = ~1,
data = shrub_seed_sub, family = "nbinom2")
With the zero-inflated model created, we will now run the necessary diagnostics.
# check diagnostics
simulationOutput_m6 <- simulateResiduals(shrub_model6)
plot(simulationOutput_m6)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m6)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.1974, p-value = 0.472
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m6)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 1.0479, p-value = 0.76
## alternative hypothesis: two.sided
Based on the above plots, shrub_model6 appears to be our best model. There is no over or underdispersion, and our distribution has about the same number of zeros as our model would predict. There is still some trend in our residuals, but it appears to have been lessened when compared to shrub_model3. We acknowledge this trend in residuals and have decided that isn’t a big cause for concern since the other diagnostics were met.
Moving forward, we will check AICC values to test whether our assumptions about the models are correct.
MuMIn::model.sel(shrub_model1, shrub_model2, shrub_model3, shrub_model4, shrub_model5, shrub_model6)
## Model selection table
## (Intr) spc ttl_inf trt cnd((Int)) dsp((Int)) cnd(spc) cnd(ttl_inf)
## shrub_model6 2.156 + + 0.06634
## shrub_model3 1.917 + + 0.07502
## shrub_model5 1.915 + + 0.07505
## shrub_model1 -2.567 + 2.15000 +
## shrub_model4 2.488 + + 0.03145
## shrub_model2 2.549 + 0.02989 +
## cnd(trt) zi((Int)) family class random df logLik AICc
## shrub_model6 + -2.283 n2(lg) glmmTMB 9 -548.867 1116.8
## shrub_model3 + n2(lg) glmmTMB 8 -552.052 1120.9
## shrub_model5 + n2(lg) glmmTMB c(s_n) 9 -551.884 1122.8
## shrub_model1 gs(id) lm 8 -689.162 1395.2
## shrub_model4 + ps(lg) glmmTMB c(s_n) 8 -1055.923 2128.7
## shrub_model2 ps(lg) glm 7 -1147.380 2309.4
## delta weight
## shrub_model6 0.00 0.852
## shrub_model3 4.15 0.107
## shrub_model5 6.03 0.042
## shrub_model1 278.37 0.000
## shrub_model4 1011.90 0.000
## shrub_model2 1192.62 0.000
## Abbreviations:
## family: gs(id) = 'gaussian(identity)', n2(lg) = 'nbinom2(log)',
## ps(lg) = 'poisson(log)'
## Models ranked by AICc(x)
## Random terms:
## c(s_n): cond(1 | shrub_num)
This chart confirms our assumptions about our models; shrub_model6 is best, followed by shrub_model3 and shrub_model5. We know this because of their respective AICC values of 1127.6, 1132.0, and 1133.9. In general, a lower AICC value equates to a better model.
Next, we will look closer at our model: shrub_model6.
# summary
summary(shrub_model6)
## Family: nbinom2 ( log )
## Formula: num_seeds ~ treatment + species + total_inf
## Zero inflation: ~1
## Data: shrub_seed_sub
##
## AIC BIC logLik deviance df.resid
## 1115.7 1144.5 -548.9 1097.7 172
##
##
## Dispersion parameter for nbinom2 family (): 2.18
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.155988 0.212883 10.128 < 2e-16 ***
## treatmentshrub -0.299250 0.131830 -2.270 0.0232 *
## speciesCARRUP -1.718777 0.281984 -6.095 1.09e-09 ***
## speciesGEUROS -0.238970 0.242542 -0.985 0.3245
## speciesKOBMYO 0.046928 0.201021 0.233 0.8154
## speciesMINOBT -0.416711 0.212101 -1.965 0.0495 *
## total_inf 0.066343 0.007516 8.827 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2827 0.3991 -5.719 1.07e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary of our model shows that treatmentshrub, speciesCARRUP, speciesMINOBT, speciesTRIDAS, and total_inf all appears to be significant indicators of total number of seeds.
# confidence intervals
confint(shrub_model6)
## 2.5 % 97.5 % Estimate
## cond.(Intercept) 1.7387449 2.573230737 2.15598779
## cond.treatmentshrub -0.5576313 -0.040868234 -0.29924976
## cond.speciesCARRUP -2.2714556 -1.166099126 -1.71877737
## cond.speciesGEUROS -0.7143430 0.236403207 -0.23896992
## cond.speciesKOBMYO -0.3470654 0.440921896 0.04692825
## cond.speciesMINOBT -0.8324210 -0.001001714 -0.41671134
## cond.total_inf 0.0516118 0.081073969 0.06634289
## zi.(Intercept) -3.0649895 -1.500469125 -2.28272929
# adjusted R2
r.squaredGLMM(shrub_model6)
## R2m R2c
## delta 0.7225478 0.7225478
## lognormal 0.7643760 0.7643760
## trigamma 0.6655456 0.6655456
# model object in table
shrub_model6 |>
as_flextable()
Estimate | Standard Error | statistic | p-value | ||||
|---|---|---|---|---|---|---|---|
Fixed effects | |||||||
(Intercept) | 2.156 | 0.213 | 10.128 | 0.0000 | *** | ||
treatmentshrub | -0.299 | 0.132 | -2.270 | 0.0232 | * | ||
speciesCARRUP | -1.719 | 0.282 | -6.095 | 0.0000 | *** | ||
speciesGEUROS | -0.239 | 0.243 | -0.985 | 0.3245 |
| ||
speciesKOBMYO | 0.047 | 0.201 | 0.233 | 0.8154 |
| ||
speciesMINOBT | -0.417 | 0.212 | -1.965 | 0.0495 | * | ||
total_inf | 0.066 | 0.008 | 8.827 | 0.0000 | *** | ||
(Intercept) | -2.283 | 0.399 | -5.719 | 0.0000 | *** | ||
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||||
square root of the estimated residual variance: 2.2 | |||||||
data's log-likelihood under the model: -548.9 | |||||||
Akaike Information Criterion: 1,115.7 | |||||||
Bayesian Information Criterion: 1,144.5 | |||||||
The above displays what we studied earlier in the summary, just in an easier to digest format.
# Compute predictions
predictions <- ggpredict(shrub_model6, terms = c("treatment", "species", "total_inf"))
# Plot showing seed differences by Species in the ACTUAL DATA
ggplot(shrub_seed_sub, aes(x = treatment, y = num_seeds, fill = treatment)) +
geom_boxplot(aes(color = treatment), alpha = 0.5) +
facet_wrap(~ species) +
scale_fill_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
scale_color_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
theme_bw() +
labs(
x = "Species",
y = "Number of Seeds",
title = "Comparing Measured Number of Seeds per Species"
) +
coord_cartesian(ylim = c(0, 70))
# Predictions
pred <- ggpredict(shrub_model6, terms = "species", back.transform = TRUE)
# Plot showing seed differences by Species
plot(pred) +
labs(
x = "Species",
y = "Number of Seeds",
title = "Comparing Estimated Number of Seeds per Species"
)
# Viz showing measured seed difference by plot type
plottype1 <- ggplot(shrub_seed_sub, aes(x = treatment, y = num_seeds, fill = treatment)) +
geom_boxplot(aes(color = treatment), alpha = 0.5) +
scale_fill_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
scale_color_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
theme_bw() +
labs(
x = "Species",
y = "Number of Seeds",
title = "Comparing Measured Number of Seeds per Species"
) +
coord_cartesian(ylim = c(0, 40)) +
theme(legend.position = "none")
# Predictions
pred2 <- ggpredict(shrub_model6, terms = "treatment", back.transform = TRUE)
# Viz showing estimated seed difference by plot type
plottype2 <- plot(pred2) +
labs(
x = "Plot Type",
y = "Number of Seeds",
title = "Comparing Estimated Number of Seeds per Plot Type"
) +
coord_cartesian(ylim = c(0, 40))
# making viz side by side
plottype_fig <- ggarrange(plottype1 + rremove("xlab") + rremove('ylab'),
plottype2 + rremove("xlab") + rremove('ylab'),
ncol = 2,
labels = c('A.', 'B.', 'C.', 'D.', 'E.'), # adding figure labels in top left of plot
vjust = 5, # adjusting figure label position
hjust = -4)
annotate_figure(plottype_fig,
left = textGrob("Number of Seeds", # add universal y-axis
rot = 90, vjust = 0.5, # styling
gp = gpar(cex = 1.3)),
bottom = textGrob("Plot Type",
gp = gpar(cex = 1.3))) # handle graphical parameters
inf1 <- ggplot(shrub_seed_sub, aes(x = total_inf, y = num_seeds)) +
geom_point() +
theme_bw() +
labs(
x = "Number of inflorescences",
y = "Number of Seeds",
title = "Comparing Measured Number of Seeds to Number of inflorescences"
) +
coord_cartesian(xlim = c(0, 60),
ylim = c(0, 500))
# Predictions
pred3 <- ggpredict(shrub_model6, terms = "total_inf", back.transform = TRUE)
# Plot showing seed differences by Species
inf2 <- plot(pred3) +
labs(
x = "Number of inflorescences",
y = "Number of Seeds",
title = "Comparing Estimated Number of Seeds to Number of inflorescences"
) +
coord_cartesian(xlim = c(0, 60),
ylim = c(0, 500))
# making viz side by side
inf_fig <- ggarrange(inf1 + rremove("xlab") + rremove('ylab'),
inf2 + rremove("xlab") + rremove('ylab'),
ncol = 2,
labels = c('A.', 'B.', 'C.', 'D.', 'E.'), # adding figure labels in top left of plot
vjust = 5, # adjusting figure label position
hjust = -4)
annotate_figure(inf_fig,
left = textGrob("Number of Seeds", # add universal y-axis
rot = 90, vjust = 0.5, # styling
gp = gpar(cex = 1.3)),
bottom = textGrob("Number of inflorescences",
gp = gpar(cex = 1.3))) # handle graphical parameters
To answer the question we are studying, how does seed count vary with plot type (shrub or open), plant species, and total number of inflorescences, we need to look at the effects of each of the variables.
The first pair of plots show that different species result in a wide variety in the seed count. In general, our model tends to underestimate the number of seeds for all of the species. However, it does correctly predict which species result in a larger number of seeds or a smaller number of seeds, which we believe to be the more valuable insight.
The second pair of plots show that the number of seeds is generally higher in the open (control) plot. It does this well, correctly predicting that the open plot typically has a higher number of seeds compared to the shrub plot.
Lastly, the third pair of plots show that the number of seeds increases as the number of total inflorescences increases. Because of the lack of samples with high inflorescences, the confidence interval grows significantly on the latter part of the plot. However, where we have a lot of data points, the model predicts the number of seeds very well.
Now that we know how seed count varies with plot type, species, and total number of inflorescences, we will investigate to see if we can predict seed count with a simpler model.